Educational Technology
Improving Task-Specific Multimodal Sentiment Analysis with General MLLMs via Prompting
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from diverse data types, such as video, audio, and language. Recent progress in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across various tasks. However, in MSA, the increase in computational costs does not always correspond to a significant improvement in performance, raising concerns about the cost-effectiveness of applying MLLMs to MSA. This paper introduces the MLLMGuided Multimodal Sentiment Learning Framework (MMSLF). It improves the performance of task-specific MSA models by leveraging the generalized knowledge of MLLMs through a teacher-student framework, rather than directly using MLLMs for sentiment prediction. First, the proposed teacher built upon a powerful MLLM (e.g., GPT-4o-mini), guides the student model to align multimodal representations through MLLM-generated context-aware prompts. Then, knowledge distillation enables the student to mimic the teacher's predictions, thus allowing it to predict sentiment independently without relying on the context-aware prompts. Extensive experiments on the SIMS, MOSI, and MOSEI datasets demonstrate that our framework enables task-specific models to achieve state-of-the-art performance across most metrics. This also provides new insights into the application of general MLLMs for improving MSA.1
Knowledge Starts with Practice: Knowledge-Aware Exercise Generative Recommendation with Adaptive Multi-Agent Cooperation
Adaptive learning, which requires the in-depth understanding of students' learning processes and rational planning of learning resources, plays a crucial role in intelligent education. However, how to effectively model these two processes and seamlessly integrate them poses significant implementation challenges for adaptive learning. As core learning resources, exercises have the potential to diagnose students' knowledge states during the learning processes and provide personalized learning recommendations to strengthen students' knowledge, thereby serving as a bridge to boost student-oriented adaptive learning. Therefore, we introduce a novel task called Knowledge-aware Exercise Generative Recommendation (KEGR). It aims to dynamically infer students' knowledge states from their past exercise responses and customizably generate new exercises. To achieve KEGR, we propose an adaptive multi-agent cooperation framework, called ExeGen, inspired by the excellent reasoning and generative capabilities of LLM-based AI agents. Specifically, ExeGen coordinates four specialized agents for supervision, knowledge state perception, exercise generation, and quality refinement through an adaptive loop workflow pipeline. More importantly, we devise two enhancement mechanisms in ExeGen: 1) A human-simulated knowledge perception mechanism mimics students' cognitive processes and generates interpretable knowledge state descriptions via demonstration-based In-Context Learning (ICL). In this mechanism, a dualmatching strategy is further designed to retrieve highly relevant demonstrations for reliable ICL reasoning.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose AdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.
SKETCHMIND: AMulti-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SKETCHMIND, a cognitively grounded, multi-agent framework for evaluating and improving studentdrawn scientific sketches. SKETCHMIND introduces Sketch Reasoning Graphs (SRGs), semantic graph representations that embed domain concepts and Bloom's taxonomy-based cognitive labels. The system comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SKETCHMIND on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG(average accuracy: 55.6%), and with bSRGintegration achieves 77.1% average accuracy (+21.4% average absolute gain).
FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation
Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating training-time refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new stateof-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 48.4% test accuracy in single-model dataset distillation and 50.9% in multi-model dataset distillation, surpassing RDED by +6.4% and outperforming
DeepKD: ADeeply Decoupled and Denoised Knowledge Distillation Trainer
Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates duallevel decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-taskoriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWORLD-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset JEDI, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on JEDI demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWORLD-G. Furthermore, we demonstrate that improved grounding with JEDI directly enhances agentic capabilities of general foundation models on complex computer tasks with state-of-the-art performance, improving from 23% to 51% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces.
Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing
We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement.
Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings
Mixture-of-Experts (MoEs) achieve scalability by dynamically activating subsets of their components. Yet, understanding how expertise emerges through joint training of gating mechanisms and experts remains incomplete, especially in scenarios without clear task partitions. Motivated by inference costs and data heterogeneity, we study how joint training of gating functions and experts can dynamically allocate domain-specific expertise across multiple underlying data distributions. As an outcome of our framework, we develop an instance tailored specifically to decentralized training scenarios, introducing Dynamically Decentralized Orchestration of MoEs or DDOME. DDOME leverages heterogeneity emerging from distributional shifts across decentralized data sources to specialize experts dynamically. By integrating a pretrained common expert to inform a gating function, DDOMEachieves personalized expert subset selection on-the-fly, facilitating just-in-time personalization.